Algorithms and tools for the analysis of high throughput DNA sequencing data

نویسنده

  • Marcel Martin
چکیده

High-throughput DNA sequencing technologies make it possible to determine the order of the nucleotides adenine, cytosine, guanine and thymine in DNA samples, resulting in millions of short strings (reads) over the alphabet (A, C, G, T). Advances in biological and biomedical research rely on the ability of bioinformatics to make sense out of that data with novel algorithms and tools. In this thesis, we contribute on four levels to the typical data processing pipeline in sequencing experiments and provide soware tools that implement the described algorithms. When sequenced DNA fragments are short, reads can contain adapter sequences. ese artifacts are a technical requirement of the sequencing process. We describe how to remove them with amodified semiglobal alignment algorithm that finds overlapping regions between read and adapter. e algorithm is designed to only find alignments below a given error rate threshold, where the error rate is defined as the number of errors divided by the number of aligned adapter characters. We show how to use only linear space while still keeping track of all information necessary to correctly locate and remove adapter sequences. e algorithm can remove adapters also from colorspace reads, which come from a sequencing technology that queries two adjacent nucleotides (colors) of DNA at the same time. We show how to modify the trimming procedure to get correct results. e easy-to-use cutadapt tool is introduced. It contains additional features that make pre-processing of adapter-contaminated reads simple, and is in use by many other researchers. e next step in the pipeline is read mapping, where the likely origin of reads is found on a given reference DNA.We concentrate on mapping reads from bisulfite sequencing experiments, in which sodium bisulfite is used to determine which cytosines have a methyl group attached to them. Methylation changes gene expression and is therefore biologically interesting. Bisulfite converts unmethylated cytosines into thymines. By comparing modified reads to the reference, methylation patterns can be determined. To map reads while allowing sequencing errors and also differences from bisulfite conversion, we introduce the bisulfite q-gram index, an extension of regular q-gram indices. For a given q-gram (string of length q), the index returns all positions in the reference where that bisulfite-converted q-gram may have originated. By efficiently simulating bisulfite conversion of the reference, the index can be constructed in time proportional to itsmemory usage. Simulation theoretically leads to an exponential increase in index size, but size is only triple that of a regular index on realistic references. We describe how to map reads with the index with the seed-and-extend paradigm, first finding short matches with the help of the index, and extending them to longer maximal error-free matches (seeds) with either a deterministic finite automaton (DFA) or an efficient bitparallel algorithm. Seeds are then extended to an alignment that covers the full read, and parts that were not bisulfite converted are detected. We show that the number of bisulfite strings of a given length n is approximately 1.19 · 3.3n, and we show how to compress the index by up to 25% while retaining efficient access. We finally apply the full read mapping algorithm to a dataset of 454 bisulfite sequencing data using the Verjinxer tool.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مروری برتکنیک های توالی یابی D‏NA (نسل اول، نسل دوم و نسل سوم)

Introduction: The DNA sequencing is the most important technique in molecular biology by which the order of the nucleotides can be identified in a piece of DNA. There are several different methods for sequencing the DNA. Now, the DNA sequencing has great importance in the medical diagnostics and other medical fields. Some methods have been invented to speed up and increase the efficiency of the...

متن کامل

Strategies and Clinical Applications of Next Generation Sequencing

Abstract DNA sequencing is one of the great valuable techniques in molecular biology, which can be used to detect the sequence of nucleotides in a DNA fragment. The high-throughput se­quencing known as Next Generation Sequencing (NGS) revolutionized genomic research and molecular biology; therefore, the whole human genome can be sequenced with a low cost in several days. NGS technology is simi...

متن کامل

Strategies and Clinical Applications of Next Generation Sequencing

Abstract DNA sequencing is one of the great valuable techniques in molecular biology, which can be used to detect the sequence of nucleotides in a DNA fragment. The high-throughput se­quencing known as Next Generation Sequencing (NGS) revolutionized genomic research and molecular biology; therefore, the whole human genome can be sequenced with a low cost in several days. NGS technology is simi...

متن کامل

Application of DNA Molecular Markers in Plant Breeding (Review article)

Plant Breeding has utilized a wide range of techniques and methods to improve the quality and quantity of plants. The molecular markers are the tools that have provided a new perspective for plant breeding advancements. This article has reviewed the various advantages and uses of molecular markers and the utilization of the high potential of natural polymorphisms within communities, combined wi...

متن کامل

Cloning and sequencing of Toxoplasma gondii major surface antigen (SAG1) gene

  Genetic typing methods of T. gondii strains have been extensively perfected in recent years. From a technical point of view, many tools usable for genetic studied on single-copy loci have been used: RFLP, PCR-RFLP, sequencing, RAPD-PCR and isoenzyme analysis. We described the cloning and sequence analysis of the gene which encodes the major surface antigen (SAG1 or P30) of T. gondii. SAG1 is ...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014